{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/embeddings/jinaai_embeddings.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Jina Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-embeddings-jinaai\n",
    "%pip install llama-index-llms-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may also need other packages that do not come direcly with llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install Pillow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example, you will need an API key which you can get from https://jina.ai/embeddings/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initilise with your api key\n",
    "import os\n",
    "\n",
    "jinaai_api_key = \"YOUR_JINAAI_API_KEY\"\n",
    "os.environ[\"JINAAI_API_KEY\"] = jinaai_api_key"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embed text and queries with Jina embedding models through JinaAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can encode your text and your queries using the JinaEmbedding class. Jina offers a range of models adaptable to various use cases.\n",
    "\n",
    "|  Model | Dimension  |  Language |  MRL (matryoshka) | Context |\n",
    "|:----------------------:|:---------:|:---------:|:-----------:|:---------:|\n",
    "|  jina-embeddings-v3  |  1024 | Multilingual (89 languages)  |  Yes  | 8192 |\n",
    "|  jina-embeddings-v2-base-en |  768 |  English |  No | 8192  | \n",
    "|  jina-embeddings-v2-base-de |  768 |  German & English |  No  |  8192 | \n",
    "|  jina-embeddings-v2-base-es |  768 |  Spanish & English |  No  |  8192 | \n",
    "|  jina-embeddings-v2-base-zh | 768  |  Chinese & English |  No  |  8192 | \n",
    "\n",
    "**Recommended Model: jina-embeddings-v3 :**\n",
    "\n",
    "We recommend `jina-embeddings-v3` as the latest and most performant embedding model from Jina AI. This model features 5 task-specific adapters trained on top of its backbone, optimizing various embedding use cases.\n",
    "\n",
    "By default `JinaEmbedding` class uses `jina-embeddings-v3`. On top of the backbone, `jina-embeddings-v3` has been trained with 5 task-specific adapters for different embedding uses.\n",
    "\n",
    "**Task-Specific Adapters:**\n",
    "\n",
    "Include `task` in your request to optimize your downstream application:\n",
    "\n",
    "+ **retrieval.query**: Used to encode user queries or questions in retrieval tasks.\n",
    "+ **retrieval.passage**: Used to encode large documents in retrieval tasks at indexing time.\n",
    "+ **classification**: Used to encode text for text classification tasks.\n",
    "+ **text-matching**: Used to encode text for similarity matching, such as measuring similarity between two sentences.\n",
    "+ **separation**: Used for clustering or reranking tasks.\n",
    "\n",
    "\n",
    "**Matryoshka Representation Learning**:\n",
    "\n",
    "`jina-embeddings-v3` supports Matryoshka Representation Learning, allowing users to control the embedding dimension with minimal performance loss.  \n",
    "Include `dimensions` in your request to select the desired dimension.  \n",
    "By default, **dimensions** is set to 1024, and a number between 256 and 1024 is recommended.  \n",
    "You can reference the table below for hints on dimension vs. performance:\n",
    "\n",
    "\n",
    "|         Dimension          | 32 |  64  | 128 |  256   |  512   |   768 |  1024   | \n",
    "|:----------------------:|:---------:|:---------:|:-----------:|:---------:|:----------:|:---------:|:---------:|\n",
    "|  Average Retrieval Performance (nDCG@10)   |   52.54     | 58.54 |    61.64    | 62.72 | 63.16  | 63.3  |   63.35    | \n",
    "\n",
    "**Late Chunking in Long-Context Embedding Models**\n",
    "\n",
    "`jina-embeddings-v3` supports [Late Chunking](https://jina.ai/news/late-chunking-in-long-context-embedding-models/), the technique to leverage the model's long-context capabilities for generating contextual chunk embeddings. Include `late_chunking=True` in your request to enable contextual chunked representation. When set to true, Jina AI API will concatenate all sentences in the input field and feed them as a single string to the model. Internally, the model embeds this long concatenated string and then performs late chunking, returning a list of embeddings that matches the size of the input list. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.jinaai import JinaEmbedding\n",
    "\n",
    "text_embed_model = JinaEmbedding(\n",
    "    api_key=jinaai_api_key,\n",
    "    model=\"jina-embeddings-v3\",\n",
    "    # choose `retrieval.passage` to get passage embeddings\n",
    "    task=\"retrieval.passage\",\n",
    ")\n",
    "\n",
    "embeddings = text_embed_model.get_text_embedding(\"This is the text to embed\")\n",
    "print(\"Text dim:\", len(embeddings))\n",
    "print(\"Text embed:\", embeddings[:5])\n",
    "\n",
    "query_embed_model = JinaEmbedding(\n",
    "    api_key=jinaai_api_key,\n",
    "    model=\"jina-embeddings-v3\",\n",
    "    # choose `retrieval.query` to get query embeddings, or choose your desired task type\n",
    "    task=\"retrieval.query\",\n",
    "    # `dimensions` allows users to control the embedding dimension with minimal performance loss. by default it is 1024.\n",
    "    # A number between 256 and 1024 is recommended.\n",
    "    dimensions=512,\n",
    ")\n",
    "\n",
    "embeddings = query_embed_model.get_query_embedding(\n",
    "    \"This is the query to embed\"\n",
    ")\n",
    "print(\"Query dim:\", len(embeddings))\n",
    "print(\"Query embed:\", embeddings[:5])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embed images and queries with Jina CLIP through JinaAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also encode your images and your queries using the JinaEmbedding class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.jinaai import JinaEmbedding\n",
    "from PIL import Image\n",
    "import requests\n",
    "from numpy import dot\n",
    "from numpy.linalg import norm\n",
    "\n",
    "embed_model = JinaEmbedding(\n",
    "    api_key=jinaai_api_key,\n",
    "    model=\"jina-clip-v1\",\n",
    ")\n",
    "\n",
    "image_url = \"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcStMP8S3VbNCqOQd7QQQcbvC_FLa1HlftCiJw&s\"\n",
    "im = Image.open(requests.get(image_url, stream=True).raw)\n",
    "print(\"Image:\")\n",
    "display(im)\n",
    "\n",
    "image_embeddings = embed_model.get_image_embedding(image_url)\n",
    "print(\"Image dim:\", len(image_embeddings))\n",
    "print(\"Image embed:\", image_embeddings[:5])\n",
    "\n",
    "text_embeddings = embed_model.get_text_embedding(\n",
    "    \"Logo of a pink blue llama on dark background\"\n",
    ")\n",
    "print(\"Text dim:\", len(text_embeddings))\n",
    "print(\"Text embed:\", text_embeddings[:5])\n",
    "\n",
    "cos_sim = dot(image_embeddings, text_embeddings) / (\n",
    "    norm(image_embeddings) * norm(text_embeddings)\n",
    ")\n",
    "print(\"Cosine similarity:\", cos_sim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embed in batches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also embed text in batches, the batch size can be controlled by setting the `embed_batch_size` parameter (the default value will be 10 if not passed, and it should not be larger than 2048)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embed_model = JinaEmbedding(\n",
    "    api_key=jinaai_api_key,\n",
    "    model=\"jina-embeddings-v3\",\n",
    "    embed_batch_size=16,\n",
    "    task=\"retrieval.passage\",\n",
    ")\n",
    "\n",
    "embeddings = embed_model.get_text_embedding_batch(\n",
    "    [\"This is the text to embed\", \"More text can be provided in a batch\"]\n",
    ")\n",
    "\n",
    "print(len(embeddings))\n",
    "print(embeddings[0][:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's build a RAG pipeline using Jina AI Embeddings"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import sys\n",
    "\n",
    "logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
    "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
    "\n",
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.core.response.notebook_utils import display_source_node\n",
    "\n",
    "from IPython.display import Markdown, display"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "your_openai_key = \"YOUR_OPENAI_KEY\"\n",
    "llm = OpenAI(api_key=your_openai_key)\n",
    "embed_model = JinaEmbedding(\n",
    "    api_key=jinaai_api_key,\n",
    "    model=\"jina-embeddings-v3\",\n",
    "    embed_batch_size=16,\n",
    "    task=\"retrieval.passage\",\n",
    ")\n",
    "\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents=documents, embed_model=embed_model\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_query_retriever = index.as_retriever()\n",
    "\n",
    "search_query_retrieved_nodes = search_query_retriever.retrieve(\n",
    "    \"What happened after the thesis?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for n in search_query_retrieved_nodes:\n",
    "    display_source_node(n, source_length=2000)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  },
  "vscode": {
   "interpreter": {
    "hash": "64bcadabe4cd61f3d117ba0da9d14bf2f8e35582ff79e821f2e71056f2723d1e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
